Analyse des performances de modèles de langage sub-lexicale pour des langues peu-dotées à morphologie riche

نویسندگان

  • Hadrien Gelas
  • Solomon Teferra Abate
  • Laurent Besacier
  • François Pellegrino
چکیده

Performance analysis of sub-word language modeling for under-resourced languages with rich morphology : case study on Swahili and Amharic This paper investigates the impact on ASR performance of sub-word units for two underresourced african languages with rich morphology (Amharic and Swahili). Two subword units are considered : syllable and morpheme, the latter being obtained in a supervised or unsupervised way. The important issue of word reconstruction from the syllable (or morpheme) ASR output is also discussed. For both languages, best results are reached with morphemes got from unsupervised approach. It leads to very significant WER reduction for Amharic ASR for which LM training data is very small (2.3M words) and it also slightly reduces WER over a Word-LM baseline for Swahili ASR (28M words for LM training). A detailed analysis of the OOV word reconstruction is also presented ; it is shown that a high percentage (up to 75% for Amharic) of OOV words can be recovered with morph-based language model and appropriate reconstruction method. MOTS-CLÉS : Modèle de langage, Morphème, Hors vocabulaire, Langues peu-dotées.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analyse des performances de modèles de langage sub-lexicale pour des langues peu-dotées à morphologie riche (Performance analysis of sub-word language modeling for under-resourced languages with rich morphology: case study on Swahili and Amharic) [in French]

Performance analysis of sub-word language modeling for under-resourced languages with rich morphology : case study on Swahili and Amharic This paper investigates the impact on ASR performance of sub-word units for two underresourced african languages with rich morphology (Amharic and Swahili). Two subword units are considered : syllable and morpheme, the latter being obtained in a supervised or...

متن کامل

A State of the Art of Word Sense Induction: A Way Towards Word Sense Disambiguation for Under-Resourced Languages

______________________________________________________________________________________________ Word Sense Disambiguation (WSD), the process of automatically identifying the meaning of a polysemous word in a sentence, is a fundamental task in Natural Language Processing (NLP). Progress in this approach to WSD opens up many promising developments in the field of NLP and its applications. Indeed, ...

متن کامل

External Lexical Information for Multilingual Part-of-Speech Tagging

Morphosyntactic lexicons and word vector representations have both proven useful for improving the accuracy of statistical part-of-speech taggers. Here we compare the performances of four systems on datasets covering 16 languages, two of these systems being feature-based (MEMMs and CRFs) and two of them being neural-based (bi-LSTMs). We show that, on average, all four approaches perform similar...

متن کامل

Multilingual Compound Splitting (Segmentation Multilingue des Mots Composés) [in French]

Résumé La composition est un phénomène fréquent dans plusieurs langues, surtout dans des langues ayant une morphologie riche. Le traitement des mots composés est un défi pour les systèmes de TAL car pour la plupart, ils ne sont pas présents dans les lexiques. Dans cet article, nous présentons une méthode de segmentation des composés qui combine des caractéristiques indépendantes de la langue (m...

متن کامل

G-OWL : Vers un langage de modélisation graphique, polymorphique et typé pour la construction d'une ontologie dans la notation OWL

Résumé : Le Web Ontology Language (OWL) standardisé par le W3C a pour objectif d’offrir un langage de conception d’ontologies pour le web sémantique. L’ingénierie d’une ontologie est une activité complexe nécessitant une habilité peu accessible à des experts de contenu. En revanche, pour modéliser du contenu métier, la modélisation graphique semi-formelle est une technique souvent employée pour...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012